Intro to Bookworm

Motivation

Infinite Jest is a very long and complicated novel. There are a lot of brilliant resources connected to the book, which aim to help the reader stay afloat amongst the chaos of David Foster Wallace's obscure language, interwoven timelines and narratives, and the sprawling networks of characters. The Infinite Jest Wiki, for example, is insanely well documented and I'd recommend it to anyone reading the book.
One of the most interesting resources I found while reading was Sam Potts' Infinite Jest Diagram.

I went back to the image once or twice while I was reading IJ to work out who a character was and how they were connected to the scene. It's a fun resource to have access to while reading something so deliberately scattered.
However, Infinite Jest isn't the only "big" book out there, and as far as I know the network above was drawn up entirely by hand. I thought it would be nice to have something like this for anything I was reading. It might also function as an interesting learning resource - either for kids at a young, early-reader stage with simple books and small character networks, or for people learning about network analysis who have never bothered reading Les Miserables (again, as far as I know all of the standard example graph datasets like Les Mis and The Karate Kid were put together entirely by hand).
I thought that with a bit of thought and testing, this process was probably automatable, and it is. I can now feed bookworm any novel and have it churn out a pretty network like the one above in seconds, without any prior knowledge of the story or its characters. By virtue of the way character connections are measured, it can also tell you the relative strength of all links between characters.

Getting Started

Before we start, let's import all of the code in the bookworm module. I'll explain what each function does as we move through the notebook - we'll be covering most of build_network.py here.


In [1]:
from bookworm import *

The fisrt thing we'll do is load in a book and a list of its characters. These operations are both pretty simple. The book is loaded in as one long string from a .txt file. Character lists are stored in a .csv, with all potential names for a character stored on each row. They're loaded in as tuples of names in a list of characters.


In [2]:
book = load_book('data/raw/ij.txt', lower=True)
characters = load_characters('data/raw/characters_ij.csv')

Then we split the book down into sections. Bookworm works by looking for coocurrence of characters in these sections of the text as a proxy for their connectedness. It's a very simple trick which works stupidly well.
There are a few ways we can break down the book into sections:

  • get_sentence_sequences() uses NLTK's standard .tokenize() function to split the book into sentences.
  • get_word_sequences() uses NLTK's word_tokenize() to split the book into words, of which it will then select ordered lists of length n (default 40).
  • get_character_sequences() uses python builtins to split it into substrings of length n (default 200).

Fundamentally, they all return a list of strings which each cover a very small section of the novel. For simplicity's sake we're going to use the sentence-wise splitter.


In [3]:
sequences = get_sentence_sequences(book)

Now comes the interesting bit. We've assembled our cast, and moved the text that they inhabit into a nice, machine-interpretable format.
What we want to generate now is the blank table below which describes the presence of a character in a sentence. At this point, Bookworm hasn't really 'read' any of the text so all of the interactions between characters and sentences (where each cell in the table represents an interaction) are set to 0:

character 1 character 2 character 3
sentence 1 0 0 0
sentence 2 0 0 0
sentence 3 0 0 0
sentence 4 0 0 0

The first bit of the find_connections() sets up the blank table above.


In [4]:
df = find_connections(sequences, characters)

Next, it iterates through the list of sentences it has been fed, checking for an instance of each character. If it finds a character in the sentence, it marks their presence with a 1.
So if character 1 appears with character 2 in sentence 1, and with character 3 in sentence 2, we would see the following, with the rest of the cells remaining blank:

character 1 character 2 character 3
sentence 1 1 1 0
sentence 2 1 0 1
sentence 3 0 0 0
sentence 4 0 0 0

In the next stage, we enumerate characters coocurence with one another. We can compute this very quickly by taking the dot product of the table with its transpose.


In [5]:
cooccurence = calculate_cooccurence(df)

calculate_cooccurence() does this computation and then wipes out any interaction of a character with themselves. For the table above, this would give us:

character 1 character 2 character 3
character 1 0 1 1
character 2 1 0 0
character 3 1 0 0

showing that character 1 has interacted with character 2 and character 3, but character 2 and character 3 haven't interacted. Note the symmetry across the diagonal...

The cooccurence matrix we're referring to here is also known as an adjacency matrix - I might use the terms interchangably from here on.

The example table above is miniscule in comparison to the dozens of characters who might turn up in a reasonably sized novel, and the hundreds or thousands of opportunities they have to interact with one another. The coocurence matrix in reality is likely to contain much larger numbers between characters who regularly appear in the same sentences. Unless we're working with a really tiny, incestuous network, this coocurence matrix is also probably going to be pretty sparse. For that reason it'll often make sense to store it as a sparse matrix:


In [6]:
cooccurence = cooccurence.to_sparse()

That's the essence of what bookworm does, and everything from here onwards is just play. It really is that simple. Once we have an adjacency matrix of our characters, all of the graph theory falls into place.

So, now we can show off a few some results! Despite describing a set of tiny matrices above, we've really been computing all of Infinite Jest's massiveness while working through the notebook.

We can print the strongest relationships for a chosen character using the function below:


In [7]:
def print_five_closest(character):
    print('-'*len(str(character))
          + '\n' + str(character) + '\n'
          + '-'*len(str(character)))
    
    top_five = (cooccurence[str(character)]
                .sort_values(ascending=False)
                .index.values
                [:5])
    
    for name in top_five:
        print(name)

Applying this to 5 characters at random:


In [8]:
from random import randint

for i in range(5):
    print_five_closest(characters[randint(0, len(characters))])
    print()


------------
('joubet ',)
------------
('marathe ', 'remy ')
('desjardins ',)
('zoltan csikzentmihalyi ',)
('fdv ', 'harde ', 'fall down very ')
('gavin diehl ', 'gavin ', 'diehl ')

----------------------------------------------------
('guillaume duplessis ', 'guillaume ', 'duplessis ')
----------------------------------------------------
('marathe ', 'remy ')
('steeply ', 'hugh ')
('fortier ',)
('luria perec ', 'luria p ')
('zoltan csikzentmihalyi ',)

-------------------------------------
('the moms ', 'avril ', 'mondragon ')
-------------------------------------
('hal ',)
('orin ',)
('mario ',)
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')
('joelle ', 'van dyne ', 'lucille ')

-------------
('dymphna ',)
-------------
('petropolis khan ', 'petropolis ')
('zoltan csikzentmihalyi ',)
('evan ingersoll ', 'ingersoll ')
('gately ', 'don ')
('fully functional phil ',)

------------------------
('dean of admissions ',)
------------------------
('zoltan csikzentmihalyi ',)
('dolores epps ',)
('gavin diehl ', 'gavin ', 'diehl ')
('gately ', 'don ')
('fully functional phil ',)

Those all seem to make sense... Lets try with a few characters who we know about in more detail


In [9]:
print_five_closest(('the moms ', 'avril ', 'mondragon '))


-------------------------------------
('the moms ', 'avril ', 'mondragon ')
-------------------------------------
('hal ',)
('orin ',)
('mario ',)
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')
('joelle ', 'van dyne ', 'lucille ')

In [10]:
print_five_closest(('joelle ', 'van dyne ', 'lucille '))


------------------------------------
('joelle ', 'van dyne ', 'lucille ')
------------------------------------
('orin ',)
('gately ', 'don ')
('the moms ', 'avril ', 'mondragon ')
('erdedy ',)
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')

In [11]:
print_five_closest(('pemulis ',))


-------------
('pemulis ',)
-------------
('hal ',)
('trevor "axhandle" axford ', 'axford ', 'axhandle ')
('jim troeltsch ', 'troeltsch ')
('james struck ', 'struck ')
('keith freer ', 'freer ', 'the viking ')

In [12]:
print_five_closest(('bruce green ',))


-----------------
('bruce green ',)
-----------------
('randy ', 'lenz ')
('gately ', 'don ')
('himself ', 'mad stork ', 'jim icandenza ', 'james incandenza ')
('kate gompert ', 'gompert ')
('tommy doocey ',)

Yep... Compare the results we've generated to the ones in the diagram at the top of the notebook.

Same code, different book

Lets run the whole thing for an entirely different book and see whether we get similarly positive results. This time, Harry Potter and The Philosopher's Stone - chosen because you're more likely to have some contextual knowledge of who's who and what's what in that book.


In [13]:
book = load_book('data/raw/hp_philosophers_stone.txt', lower=True)
characters = load_characters('data/raw/characters_hp.csv')
sequences = get_sentence_sequences(book)

df = find_connections(sequences, characters)
cooccurence = calculate_cooccurence(df).to_sparse()

In [14]:
characters[:5]


Out[14]:
[('vernon ', ' dursley '),
 ('petunia ', ' dursley '),
 ('dudley ', ' duddy '),
 ('lily ',),
 ('james ',)]

In [15]:
print_five_closest(('harry ', ' potter '))


----------------------
('harry ', ' potter ')
----------------------
('ron ', ' weasley ')
('hermione ', ' granger ')
('hagrid ', ' rubeus ')
('snape ', ' severus ')
('dudley ', ' duddy ')

In [16]:
print_five_closest(('voldemort ', ' lord ', ' you-know-who '))


------------------------------------------
('voldemort ', ' lord ', ' you-know-who ')
------------------------------------------
('harry ', ' potter ')
('snape ', ' severus ')
('quirrell ',)
('dumbledore ', ' albus ')
('ron ', ' weasley ')

In [17]:
print_five_closest(('crabbe ',))


------------
('crabbe ',)
------------
('goyle ',)
('draco ', ' malfoy ')
('harry ', ' potter ')
('neville ', ' longbottom ')
('george ',)

In [18]:
print_five_closest(('fred ',))


----------
('fred ',)
----------
('george ',)
('ron ', ' weasley ')
('harry ', ' potter ')
('adrian pucey ',)
('katie bell ',)

Hopefully that's enough proof that bookworm does its job well.
In the next notebook we'll examine how we can automatically extract character names from novels in order to automate the entirity of the bookworm process.

Home | 02 - Character Building >


In [ ]: